Post-training quantization (PTQ), which only requires a tiny dataset for calibration without end-to-end retraining, is a light and practical model compression technique. Recently, several PTQ schemes for vision transformers (ViTs) have been presented; unfortunately, they typically suffer from non-trivial accuracy degradation, especially in low-bit cases. In this paper, we propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale reparameterization, to address the above issues. RepQ-ViT decouples the quantization and inference processes, where the former employs complex quantizers and the latter employs scale-reparameterized simplified quantizers. This ensures both accurate quantization and efficient inference, which distinguishes it from existing approaches that sacrifice quantization performance to meet the target hardware. More specifically, we focus on two components with extreme distributions: post-LayerNorm activations with severe inter-channel variation and post-Softmax activations with power-law features, and initially apply channel-wise quantization and log$\sqrt{2}$ quantization, respectively. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference, with only slight accuracy or computational costs. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that RepQ-ViT, without hyperparameters and expensive reconstruction procedures, can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level.
translated by 谷歌翻译
Image super-resolution is a common task on mobile and IoT devices, where one often needs to upscale and enhance low-resolution images and video frames. While numerous solutions have been proposed for this problem in the past, they are usually not compatible with low-power mobile NPUs having many computational and memory constraints. In this Mobile AI challenge, we address this problem and propose the participants to design an efficient quantized image super-resolution solution that can demonstrate a real-time performance on mobile NPUs. The participants were provided with the DIV2K dataset and trained INT8 models to do a high-quality 3X image upscaling. The runtime of all models was evaluated on the Synaptics VS680 Smart Home board with a dedicated edge NPU capable of accelerating quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 60 FPS rate when reconstructing Full HD resolution images. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
无数据量化可以潜在地解决模型压缩中的数据隐私和安全问题,因此已得到广泛研究。最近,PSAQ-VIT设计了一个相对值度量,贴片相似性,以生成预训练视觉变压器(VIT)的数据,从而实现了VIT的第一次无数据量化尝试。在本文中,我们提出了PSAQ-VIT V2,这是在PSAQ-VIT之上建立的更准确,无数据的VIT的更准确和无数据的量化框架。更具体地说,按照PSAQ-VIT中的贴片相似性度量,我们引入了一种自适应的教师学生策略,该策略促进了生成的样品的持续环节演变和量化的模型(学生),并在竞争性和互动方式下以竞争性和互动方式进行。完整的模型(教师),因此显着提高了量化模型的准确性。此外,没有辅助类别指导,我们采用了任务和模型独立的先验信息,使通用方案与广泛的视觉任务和模型兼容。对图像分类,对象检测和语义分割任务和PSAQ-VIT V2进行了各种模型进行了广泛的实验,并具有幼稚的量化策略,并且没有访问现实世界数据,从而始终取得了竞争性的结果,显示出潜力作为强大的基线的潜力关于VIT的无数据量化。例如,使用SWIN-S作为(骨干)模型,8位量化达到ImageNet上的82.13 TOP-1精度,50.9盒AP和可可的44.1 Mask AP,而ADE20K上的47.2 miOU。我们希望准确,一般的PSAQ-VIT V2可以作为涉及敏感数据的现实应用程序中的潜在和实践解决方案。代码将在以下网址发布并合并:https://github.com/zkkli/psaq-vit。
translated by 谷歌翻译
视觉变压器(VIT)在各种计算机视觉应用程序上都达到了最先进的性能。但是,这些模型具有相当大的存储和计算开销,使其部署和对边缘设备的有效推断充满了挑战。量化是降低模型复杂性的一种有前途的方法。不幸的是,现有的量化VIT的努力是模拟量化(又称假量化),该量化在推理过程中仍然是浮点算术的,因此对模型加速度无济于事。在本文中,我们提出了I-VIT,即VIT的仅整数量化方案,以使VIT能够使用整数操作和位移动和无浮点操作执行整个推理的计算图。在I-VIT中,线性操作(例如,矩阵和密集)遵循具有二元算术的仅整数管道,而非线性操作(例如,SoftMax,Gelu和Layernorm和Layernorm)近似于提议的轻量级近似算术方法。特别是,I-Vit应用了所提出的ShiftMax和ShiftGelu,它们旨在使用整数位移动来近似相应的浮点操作。我们在各种基准模型上评估了I-VIT,结果表明,仅整数INT8量化具有与完整精确(FP)基线相当(甚至更高)的精度。此外,我们在GPU的整数算术单元上使用TVM进行实用的硬件部署,与FP模型相比,实现了3.72〜4.11 $ \ times $推理的速度。
translated by 谷歌翻译
视觉变压器最近在各种计算机视觉任务上取得了巨大成功。然而,他们的高模型复杂性使部署在资源约束设备上的挑战。量化是一种有效的方法,可以减少模型复杂性,并且可以在模型部署期间解决数据隐私和安全问题的无数据量化已获得广泛的兴趣。不幸的是,所有现有的方法(例如BN正则化)都是为卷积神经网络而设计的,不能应用于具有明显不同模型体系结构的视觉变压器。在本文中,我们提出了PSAQ-VIT,这是视觉变压器的贴片相似性无数据量化框架,以根据视觉变压器的唯一属性来生成“现实”样品,以校准量化参数。具体而言,我们分析了自我发场模块的特性,并在处理高斯噪声和真实图像的处理中揭示了一般差异(斑块相似性)。以上见解指导我们设计一个相对值度量,以优化高斯噪声以近似真实的图像,然后将其用于校准量化参数。对各种基准进行了广泛的实验和消融研究,以验证PSAQ-VIT的有效性,这甚至可以优于实现DATA驱动的方法。
translated by 谷歌翻译
事实证明,大脑时代是与认知性能和脑部疾病相关的表型。实现准确的脑年龄预测是优化预测的脑时代差异作为生物标志物的必要先决条件。作为一种综合的生物学特征,很难使用特征工程和局部处理的模型来准确利用大脑时代,例如局部卷积和经常性操作,这些操作一次是一次处理一个本地社区。取而代之的是,视觉变形金刚学习斑块令牌的全球专注相互作用,引入了较少的电感偏见和建模长期依赖性。就此而言,我们提出了一个新的网络,用于学习大脑年龄,以全球和局部依赖性解释,其中相应的表示由连续排列的变压器(SPT)和卷积块捕获。 SPT带来了计算效率,并通过从不同视图中连续编码2D切片间接地定位3D空间信息。最后,我们收集了一大批22645名受试者,年龄范围从14到97,我们的网络在一系列深度学习方法中表现最好,在验证集中产生了平均绝对错误(MAE)为2.855,而在独立方面产生了2.911测试集。
translated by 谷歌翻译
智能辅助系统可以导航盲人,但其中大多数只能给出非直觉的提示或效率低下的指导。基于计算机视觉和颤振的编码,本文提出了一个交互式系统,为盲人提供直观的空间认知。与基于语音提示的传统听觉反馈策略不同,本文首先引入了一种振动编码的反馈方法,该方法利用了触觉神经途径,并使用户能够与操纵辅助设备以外的对象进行交互。基于此策略,3D空间对象定位采用了基于RGB-D摄像机的可穿戴视觉模块,这有助于在真实环境中进行准确的感知和快速对象定位。目标盲人的实验结果表明,与主流语音及时反馈方案相比,纤维触觉反馈将任务的完成时间降低了25%。拟议的对象定位系统提供了更直观的空间导航和舒适的耐磨性,以提供盲目帮助。
translated by 谷歌翻译
In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed implicitly, by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. CMT obtains 73.0% NDS on nuScenes benchmark. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code will be released at https://github.com/junjie18/CMT.
translated by 谷歌翻译
Dataset distillation has emerged as a prominent technique to improve data efficiency when training machine learning models. It encapsulates the knowledge from a large dataset into a smaller synthetic dataset. A model trained on this smaller distilled dataset can attain comparable performance to a model trained on the original training dataset. However, the existing dataset distillation techniques mainly aim at achieving the best trade-off between resource usage efficiency and model utility. The security risks stemming from them have not been explored. This study performs the first backdoor attack against the models trained on the data distilled by dataset distillation models in the image domain. Concretely, we inject triggers into the synthetic data during the distillation procedure rather than during the model training stage, where all previous attacks are performed. We propose two types of backdoor attacks, namely NAIVEATTACK and DOORPING. NAIVEATTACK simply adds triggers to the raw data at the initial distillation phase, while DOORPING iteratively updates the triggers during the entire distillation procedure. We conduct extensive evaluations on multiple datasets, architectures, and dataset distillation techniques. Empirical evaluation shows that NAIVEATTACK achieves decent attack success rate (ASR) scores in some cases, while DOORPING reaches higher ASR scores (close to 1.0) in all cases. Furthermore, we conduct a comprehensive ablation study to analyze the factors that may affect the attack performance. Finally, we evaluate multiple defense mechanisms against our backdoor attacks and show that our attacks can practically circumvent these defense mechanisms.
translated by 谷歌翻译
Automatic music generation with artificial intelligence typically requires a large amount of data which is hard to obtain for many less common genres and musical instruments. To tackle this issue, we present ongoing work and preliminary findings on the possibility for deep models to transfer knowledge from language to music, by finetuning large language models pre-trained on a massive text corpus on only hundreds of MIDI files of drum performances. We show that by doing so, one of the largest, state-of-the-art models (GPT3) is capable of generating reasonable drum grooves, while models that are not pre-trained (Transformer) shows no such ability beyond naive repetition. Evaluating generated music is a challenging task, more so is evaluating drum grooves with little precedence in literature. Hence, we propose a tailored structural evaluation method and analyze drum grooves produced by GPT3 compared to those played by human professionals, exposing the strengths and weaknesses of such generation by language-to-music transfer. Our findings suggest that language-to-music transfer learning with large language models is viable and promising.
translated by 谷歌翻译